INTERSPEECH.2017 - Speech Synthesis

Total: 20

#1 An RNN-Based Quantized F0 Model with Multi-Tier Feedback Links for Text-to-Speech Synthesis [PDF] [Copy] [Kimi1]

Authors: Xin Wang ; Shinji Takaki ; Junichi Yamagishi

A recurrent-neural-network-based F0 model for text-to-speech (TTS) synthesis that generates F0 contours given textual features is proposed. In contrast to related F0 models, the proposed one is designed to learn the temporal correlation of F0 contours at multiple levels. The frame-level correlation is covered by feeding back the F0 output of the previous frame as the additional input of the current frame; meanwhile, the correlation over long-time spans is similarly modeled but by using F0 features aggregated over the phoneme and syllable. Another difference is that the output of the proposed model is not the interpolated continuous-valued F0 contour but rather a sequence of discrete symbols, including quantized F0 levels and a symbol for the unvoiced condition. By using the discrete F0 symbols, the proposed model avoids the influence of artificially interpolated F0 curves. Experiments demonstrated that the proposed F0 model, which was trained using a dropout strategy, generated smooth F0 contours with relatively better perceived quality than those from baseline RNN models.

#2 Phrase Break Prediction for Long-Form Reading TTS: Exploiting Text Structure Information [PDF] [Copy] [Kimi1]

Authors: Viacheslav Klimkov ; Adam Nadolski ; Alexis Moinet ; Bartosz Putrycz ; Roberto Barra-Chicote ; Thomas Merritt ; Thomas Drugman

Phrasing structure is one of the most important factors in increasing the naturalness of text-to-speech (TTS) systems, in particular for long-form reading. Most existing TTS systems are optimized for isolated short sentences, and completely discard the larger context or structure of the text. This paper presents how we have built phrasing models based on data extracted from audiobooks. We investigate how various types of textual features can improve phrase break prediction: part-of-speech (POS), guess POS (GPOS), dependency tree features and word embeddings. These features are fed into a bidirectional LSTM or a CART baseline. The resulting systems are compared using both objective and subjective evaluations. Using BiLSTM and word embeddings proves to be beneficial.

#3 Physically Constrained Statistical F0 Prediction for Electrolaryngeal Speech Enhancement [PDF] [Copy] [Kimi1]

Authors: Kou Tanaka ; Hirokazu Kameoka ; Tomoki Toda ; Satoshi Nakamura

Electrolaryngeal (EL) speech produced by a laryngectomee using an electrolarynx to mechanically generate artificial excitation sounds severely suffers from unnatural fundamental frequency (F0) patterns caused by monotonic excitation sounds. To address this issue, we have previously proposed EL speech enhancement systems using statistical F0 pattern prediction methods based on a Gaussian Mixture Model (GMM), making it possible to predict the underlying F0 pattern of EL speech from its spectral feature sequence. Our previous work revealed that the naturalness of the predicted F0 pattern can be improved by incorporating a physically based generative model of F0 patterns into the GMM-based statistical F0 prediction system within a Product-of-Expert framework. However, one drawback of this method is that it requires an iterative procedure to obtain a predicted F0 pattern, making it difficult to realize a real-time system. In this paper, we propose yet another approach to physically based statistical F0 pattern prediction by using a HMM-GMM framework. This approach is noteworthy in that it allows to generate an F0 pattern that is both statistically likely and physically natural without iterative procedures. Experimental results demonstrated that the proposed method was capable of generating F0 patterns more similar to those in normal speech than the conventional GMM-based method.

#4 DNN-SPACE: DNN-HMM-Based Generative Model of Voice F0 Contours for Statistical Phrase/Accent Command Estimation [PDF] [Copy] [Kimi1]

Authors: Nobukatsu Hojo ; Yasuhito Ohsugi ; Yusuke Ijima ; Hirokazu Kameoka

This paper proposes a method to extract prosodic features from a speech signal by leveraging auxiliary linguistic information. A prosodic feature extractor called the statistical phrase/accent command estimation (SPACE) has recently been proposed. This extractor is based on a statistical model formulated as a stochastic counterpart of the Fujisaki model, a well-founded mathematical model representing the control mechanism of vocal fold vibration. The key idea of this approach is that a phrase/accent command pair sequence is modeled as an output sequence of a path-restricted hidden Markov model (HMM) so that estimating the state transition amounts to estimating the phrase/accent commands. Since the phrase and accent commands are related to linguistic information, we may expect to improve the command estimation accuracy by using them as auxiliary information for the inference. To model the relationship between the phrase/accent commands and linguistic information, we construct a deep neural network (DNN) that maps the linguistic feature vectors to the state posterior probabilities of the HMM. Thus, given a pitch contour and linguistic information, we can estimate phrase/accent commands via state decoding. We call this method “DNN-SPACE.” Experimental results revealed that using linguistic information was effective in improving the command estimation accuracy.

#5 Controlling Prominence Realisation in Parametric DNN-Based Speech Synthesis [PDF] [Copy] [Kimi1]

Authors: Zofia Malisz ; Harald Berthelsen ; Jonas Beskow ; Joakim Gustafson

This work aims to improve text-to-speech synthesis for Wikipedia by advancing and implementing models of prosodic prominence. We propose a new system architecture with explicit prominence modeling and test the first component of the architecture. We automatically extract a phonetic feature related to prominence from the speech signal in the ARCTIC corpus. We then modify the label files and train an experimental TTS system based on the feature using Merlin, a statistical-parametric DNN-based engine. Test sentences with contrastive prominence on the word-level are synthesised and separate listening tests a) evaluating the level of prominence control in generated speech, and b) naturalness, are conducted. Our results show that the prominence feature-enhanced system successfully places prominence on the appropriate words and increases perceived naturalness relative to the baseline.

#6 Increasing Recall of Lengthening Detection via Semi-Automatic Classification [PDF] [Copy] [Kimi1]

Authors: Simon Betz ; Jana Voße ; Sina Zarrieß ; Petra Wagner

Lengthening is the ideal hesitation strategy for synthetic speech and dialogue systems: it is unobtrusive and hard to notice, because it occurs frequently in everyday speech before phrase boundaries, in accentuation, and in hesitation. Despite its elusiveness, it allows valuable extra time for computing or information highlighting in incremental spoken dialogue systems. The elusiveness of the matter, however, poses a challenge for extracting lengthening instances from corpus data: we suspect a recall problem, as human annotators might not be able to consistently label lengthening instances. We address this issue by filtering corpus data for instances of lengthening, using a simple classification method, based on a threshold for normalized phone duration. The output is then manually labeled for disfluency. This is compared to an existing, fully manual disfluency annotation, showing that recall is significantly higher with semi-automatic pre-classification. This shows that it is inevitable to use semi-automatic pre-selection to gather enough candidate data points for manual annotation and subsequent lengthening analyses. Also, it is desirable to further increase the performance of the automatic classification. We evaluate in detail human versus semi-automatic annotation and train another classifier on the resulting dataset to check the integrity of the disfluent – non-disfluent distinction.

#7 Principles for Learning Controllable TTS from Annotated and Latent Variation [PDF] [Copy] [Kimi1]

Authors: Gustav Eje Henter ; Jaime Lorenzo-Trueba ; Xin Wang ; Junichi Yamagishi

For building flexible and appealing high-quality speech synthesisers, it is desirable to be able to accommodate and reproduce fine variations in vocal expression present in natural speech. Synthesisers can enable control over such output properties by adding adjustable control parameters in parallel to their text input. If not annotated in training data, the values of these control inputs can be optimised jointly with the model parameters. We describe how this established method can be seen as approximate maximum likelihood and MAP inference in a latent variable model. This puts previous ideas of (learned) synthesiser inputs such as sentence-level control vectors on a more solid theoretical footing. We furthermore extend the method by restricting the latent variables to orthogonal subspaces via a sparse prior. This enables us to learn dimensions of variation present also within classes in coarsely annotated speech. As an example, we train an LSTM-based TTS system to learn nuances in emotional expression from a speech database annotated with seven different acted emotions. Listening tests show that our proposal successfully can synthesise speech with discernible differences in expression within each emotion, without compromising the recognisability of synthesised emotions compared to an identical system without learned nuances.

#8 Sampling-Based Speech Parameter Generation Using Moment-Matching Networks [PDF] [Copy] [Kimi1]

Authors: Shinnosuke Takamichi ; Tomoki Koriyama ; Hiroshi Saruwatari

This paper presents sampling-based speech parameter generation using moment-matching networks for Deep Neural Network (DNN)-based speech synthesis. Although people never produce exactly the same speech even if we try to express the same linguistic and para-linguistic information, typical statistical speech synthesis produces completely the same speech, i.e., there is no inter-utterance variation in synthetic speech. To give synthetic speech natural inter-utterance variation, this paper builds DNN acoustic models that make it possible to randomly sample speech parameters. The DNNs are trained so that they make the moments of generated speech parameters close to those of natural speech parameters. Since the variation of speech parameters is compressed into a low-dimensional simple prior noise vector, our algorithm has lower computation cost than direct sampling of speech parameters. As the first step towards generating synthetic speech that has natural inter-utterance variation, this paper investigates whether or not the proposed sampling-based generation deteriorates synthetic speech quality. In evaluation, we compare speech quality of conventional maximum likelihood-based generation and proposed sampling-based generation. The result demonstrates the proposed generation causes no degradation in speech quality.

#9 Unit Selection with Hierarchical Cascaded Long Short Term Memory Bidirectional Recurrent Neural Nets [PDF] [Copy] [Kimi1]

Authors: Vincent Pollet ; Enrico Zovato ; Sufian Irhimeh ; Pier Batzu

Bidirectional recurrent neural nets have demonstrated state-of-the-art performance for parametric speech synthesis. In this paper, we introduce a top-down application of recurrent neural net models to unit-selection synthesis. A hierarchical cascaded network graph predicts context phone duration, speech unit encoding and frame-level logF0 information that serves as targets for the search of units. The new approach is compared with an existing state-of-art hybrid system that uses Hidden Markov Models as basis for the statistical unit search.

#10 Utterance Selection for Optimizing Intelligibility of TTS Voices Trained on ASR Data [PDF] [Copy] [Kimi1]

Authors: Erica Cooper ; Xinyue Wang ; Alison Chang ; Yocheved Levitan ; Julia Hirschberg

This paper describes experiments in training HMM-based text-to-speech (TTS) voices on data collected for Automatic Speech Recognition (ASR) training. We compare a number of filtering techniques designed to identify the best utterances from a noisy, multi-speaker corpus for training voices, to exclude speech containing noise and to include speech close in nature to more traditionally-collected TTS corpora. We also evaluate the use of automatic speech recognizers for intelligibility assessment in comparison with crowdsourcing methods. While the goal of this work is to develop natural-sounding and intelligible TTS voices in Low Resource Languages (LRLs) rapidly and easily, without the expense of recording data specifically for this purpose, we focus on English initially to identify the best filtering techniques and evaluation methods. We find that, when a large amount of data is available, selecting from the corpus based on criteria such as standard deviation of f0, fast speaking rate, and hypo-articulation produces the most intelligible voices.

#11 Bias and Statistical Significance in Evaluating Speech Synthesis with Mean Opinion Scores [PDF] [Copy] [Kimi1]

Authors: Andrew Rosenberg ; Bhuvana Ramabhadran

Listening tests and Mean Opinion Scores (MOS) are the most commonly used techniques for the evaluation of speech synthesis quality and naturalness. These are invaluable in the assessment of subjective qualities of machine generated stimuli. However, there are a number of challenges in understanding the MOS scores that come out of listening tests. Primarily, we advocate for the use of non-parametric statistical tests in the calculation of statistical significance when comparing listening test results. Additionally, based on the results of 46 legacy listening tests, we measure the impact of two sources of bias. Bias introduced by individual participants and synthesized text can a dramatic impact on observed MOS scores. For example, we find that on average the mean difference between the highest and lowest scoring rater is over 2 MOS points (on a 5 point scale). From this observation, we caution against using any statistical test without adjusting for this bias, and provide specific non-parametric recommendations.

#12 Phase Modeling Using Integrated Linear Prediction Residual for Statistical Parametric Speech Synthesis [PDF] [Copy] [Kimi1]

Authors: Nagaraj Adiga ; S.R. Mahadeva Prasanna

The conventional statistical parametric speech synthesis (SPSS) focus on characteristics of the magnitude spectrum of speech for speech synthesis by ignoring phase characteristics of speech. In this work, the role of phase information to improve the naturalness of synthetic speech is explored. The phase characteristics of excitation signal are estimated from the integrated linear prediction residual (ILPR) using an all-pass (AP) filter. The coefficients of the AP filter are estimated by minimizing an entropy based objective function from the cosine phase of the analytical signal obtained from ILPR signal. The AP filter coefficients (APCs) derived from the AP filter are used as features for modeling phase in SPSS. During synthesis time, to generate the excitation signal, frame wise generated APCs are used to add the group delay to the impulse excitation. The proposed method is compared with the group delay based phase excitation used in the STRAIGHT method. The experimental results show that proposed phased modeling having a better perceptual synthesis quality when compared with the STRAIGHT method.

#13 Evaluation of a Silent Speech Interface Based on Magnetic Sensing and Deep Learning for a Phonetically Rich Vocabulary [PDF] [Copy] [Kimi1]

Authors: Jose A. Gonzalez ; Lam A. Cheah ; Phil D. Green ; James M. Gilbert ; Stephen R. Ell ; Roger K. Moore ; Ed Holdsworth

To help people who have lost their voice following total laryngectomy, we present a speech restoration system that produces audible speech from articulator movement. The speech articulators are monitored by sensing changes in magnetic field caused by movements of small magnets attached to the lips and tongue. Then, articulator movement is mapped to a sequence of speech parameter vectors using a transformation learned from simultaneous recordings of speech and articulatory data. In this work, this transformation is performed using a type of recurrent neural network (RNN) with fixed latency, which is suitable for real-time processing. The system is evaluated on a phonetically-rich database with simultaneous recordings of speech and articulatory data made by non-impaired subjects. Experimental results show that our RNN-based mapping obtains more accurate speech reconstructions (evaluated using objective quality metrics and a listening test) than articulatory-to-acoustic mappings using Gaussian mixture models (GMMs) or deep neural networks (DNNs). Moreover, our fixed-latency RNN architecture provides comparable performance to an utterance-level batch mapping using bidirectional RNNs (BiRNNs).

#14 Predicting Head Pose from Speech with a Conditional Variational Autoencoder [PDF] [Copy] [Kimi1]

Authors: David Greenwood ; Stephen Laycock ; Iain Matthews

Natural movement plays a significant role in realistic speech animation. Numerous studies have demonstrated the contribution visual cues make to the degree we, as human observers, find an animation acceptable. Rigid head motion is one visual mode that universally co-occurs with speech, and so it is a reasonable strategy to seek a transformation from the speech mode to predict the head pose. Several previous authors have shown that prediction is possible, but experiments are typically confined to rigidly produced dialogue. Natural, expressive, emotive and prosodic speech exhibit motion patterns that are far more difficult to predict with considerable variation in expected head pose. Recently, Long Short Term Memory (LSTM) networks have become an important tool for modelling speech and natural language tasks. We employ Deep Bi-Directional LSTMs (BLSTM) capable of learning long-term structure in language, to model the relationship that speech has with rigid head motion. We then extend our model by conditioning with prior motion. Finally, we introduce a generative head motion model, conditioned on audio features using a Conditional Variational Autoencoder (CVAE). Each approach mitigates the problems of the one to many mapping that a speech to head pose model must accommodate.

#15 Real-Time Reactive Speech Synthesis: Incorporating Interruptions [PDF] [Copy] [Kimi1]

Authors: Mirjam Wester ; David A. Braude ; Blaise Potard ; Matthew P. Aylett ; Francesca Shaw

The ability to be interrupted and react in a realistic manner is a key requirement for interactive speech interfaces. While previous systems have long implemented techniques such as ‘barge in’ where speech output can be halted at word or phrase boundaries, less work has explored how to mimic human speech output responses to real-time events like interruptions which require a reaction from the system. Unlike previous work which has focused on incremental production, here we explore a novel re-planning approach. The proposed system is versatile and offers a large range of possible ways to react. A focus group was used to evaluate the approach, where participants interacted with a system reading out a text. The system would react to audio interruptions, either with no reactions, passive reactions, or active negative reactions (i.e. getting increasingly irritated). Participants preferred a reactive system.

#16 A Neural Parametric Singing Synthesizer [PDF] [Copy] [Kimi1]

Authors: Merlijn Blaauw ; Jordi Bonada

We present a new model for singing synthesis based on a modified version of the WaveNet architecture. Instead of modeling raw waveform, we model features produced by a parametric vocoder that separates the influence of pitch and timbre. This allows conveniently modifying pitch to match any target melody, facilitates training on more modest dataset sizes, and significantly reduces training and generation times. Our model makes frame-wise predictions using mixture density outputs rather than categorical outputs in order to reduce the required parameter count. As we found overfitting to be an issue with the relatively small datasets used in our experiments, we propose a method to regularize the model and make the autoregressive generation process more robust to prediction errors. Using a simple multi-stream architecture, harmonic, aperiodic and voiced/unvoiced components can all be predicted in a coherent manner. We compare our method to existing parametric statistical and state-of-the-art concatenative methods using quantitative metrics and a listening test. While naive implementations of the autoregressive generation algorithm tend to be inefficient, using a smart algorithm we can greatly speed up the process and obtain a system that’s competitive in both speed and quality.

#17 Tacotron: Towards End-to-End Speech Synthesis [PDF] [Copy] [Kimi1]

Authors: Yuxuan Wang ; R.J. Skerry-Ryan ; Daisy Stanton ; Yonghui Wu ; Ron J. Weiss ; Navdeep Jaitly ; Zongheng Yang ; Ying Xiao ; Zhifeng Chen ; Samy Bengio ; Quoc Le ; Yannis Agiomyrgiannakis ; Rob Clark ; Rif A. Saurous

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Given <text, audio> pairs, the model can be trained completely from scratch with random initialization. We present several key techniques to make the sequence-to-sequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. In addition, since Tacotron generates speech at the frame level, it’s substantially faster than sample-level autoregressive methods.

#18 Siri On-Device Deep Learning-Guided Unit Selection Text-to-Speech System [PDF1] [Copy] [Kimi1]

Authors: Tim Capes ; Paul Coles ; Alistair Conkie ; Ladan Golipour ; Abie Hadjitarkhani ; Qiong Hu ; Nancy Huddleston ; Melvyn Hunt ; Jiangchuan Li ; Matthias Neeracher ; Kishore Prahallad ; Tuomo Raitio ; Ramya Rasipuram ; Greg Townsend ; Becci Williamson ; David Winarsky ; Zhizheng Wu ; Hepeng Zhang

This paper describes Apple’s hybrid unit selection speech synthesis system, which provides the voices for Siri with the requirement of naturalness, personality and expressivity. It has been deployed into hundreds of millions of desktop and mobile devices (e.g. iPhone, iPad, Mac, etc.) via iOS and macOS in multiple languages. The system is following the classical unit selection framework with the advantage of using deep learning techniques to boost the performance. In particular, deep and recurrent mixture density networks are used to predict the target and concatenation reference distributions for respective costs during unit selection. In this paper, we present an overview of the run-time TTS engine and the voice building process. We also describe various techniques that enable on-device capability such as preselection optimization, caching for low latency, and unit pruning for low footprint, as well as techniques that improve the naturalness and expressivity of the voice such as the use of long units.

#19 An Expanded Taxonomy of Semiotic Classes for Text Normalization [PDF] [Copy] [Kimi1]

Authors: Daan van Esch ; Richard Sproat

We describe an expanded taxonomy of semiotic classes for text normalization, building upon the work in [1]. We add a large number of categories of non-standard words (NSWs) that we believe a robust real-world text normalization system will have to be able to process. Our new categories are based upon empirical findings encountered while building text normalization systems across many languages, for both speech recognition and speech synthesis purposes. We believe our new taxonomy is useful both for ensuring high coverage when writing manual grammars, as well as for eliciting training data to build machine learning-based text normalization systems.

#20 Complex-Valued Restricted Boltzmann Machine for Direct Learning of Frequency Spectra [PDF] [Copy] [Kimi1]

Authors: Toru Nakashika ; Shinji Takaki ; Junichi Yamagishi

In this paper, we propose a new energy-based probabilistic model where a restricted Boltzmann machine (RBM) is extended to deal with complex-valued visible units. The RBM that automatically learns the relationships between visible units and hidden units (but without connections in the visible or the hidden units) has been widely used as a feature extractor, a generator, a classifier, pre-training of deep neural networks, etc. However, all the conventional RBMs have assumed the visible units to be either binary-valued or real-valued, and therefore complex-valued data cannot be fed to the RBM. In various applications, however, complex-valued data is frequently used such examples include complex spectra of speech, fMRI images, wireless signals, and acoustic intensity. For the direct learning of such the complex-valued data, we define the new model called “complex-valued RBM (CRBM)” where the conditional probability of the complex-valued visible units given the hidden units forms a complex-Gaussian distribution. Another important characteristic of the CRBM is to have connections between real and imaginary parts of each of the visible units unlike the conventional real-valued RBM. Our experiments demonstrated that the proposed CRBM can directly encode complex spectra of speech signals without decoupling imaginary number or phase from the complex-value data.